Multimodal analysis of disinformation and misinformation

The use of disinformation and misinformation campaigns in the media has attracted much attention from academics and policy-makers. Multimodal analysis or the analysis of two or more semiotic systems—language, gestures, images, sounds, among others—in their interrelation and interaction is essential to understanding dis-/misinformation efforts because most human communication goes beyond just words. There is a confluence of many disciplines (e.g. computer science, linguistics, political science, communication studies) that are developing methods and analytical models of multimodal communication. This literature review brings research strands from these disciplines together, providing a map of the multi- and interdisciplinary landscape for multimodal analysis of dis-/misinformation. It records the substantial growth starting from the second quarter of 2020—the start of the COVID-19 epidemic in Western Europe—in the number of studies on multimodal dis-/misinformation coming from the field of computer science. The review examines that category of studies in more detail. Finally, the review identifies gaps in multimodal research on dis-/misinformation and suggests ways to bridge these gaps including future cross-disciplinary research directions. Our review provides scholars from different disciplines working on dis-/misinformation with a much needed bird's-eye view of the rapidly emerging research of multimodal dis-/misinformation.


Introduction
As disinformation and misinformation proliferate in everyday society, so does the collective need to fully understand the nature of this threat so that effective counter-dis-/misinformation strategies can be developed.

What is disinformation and misinformation?
We define disinformation in line with the UK's Online Harms White Paper of April 2019 [19], as the deliberate dissemination of information that is false, with the express aim to mislead or obfuscate.Misinformation is similar, but lacks intentionality.Note that disinformation can lead others into misinformation.
It is worth noting that we avoid the use of the term 'Fake News', given its politicized nature.However, a large number of authors refer to 'Fake News' as the topic of their research, which we see as a subcategory of dis-/misinformation.

What is multimodal communication? Why is it important to analyse it?
Most human communication is more than just words: it is multimodal.Humans use visual, verbal and sound modes in order to communicate.What do intonation, facial expression, gesture and body language add to a communicated message?How do people use emojis, images and videos to communicate on social media?How are television or cinema audiences directed, or manipulated, by the producers' choice of timing, settings, camera movement, etc.? How do political entities frame the same event from different angles by foregrounding certain aspects of multimodal communication?Ultimately, how can an understanding of the underlying mechanisms of multimodal communication inform how we live our lives?If we are to understand human communication in its full complexity then we need to answer questions such as these.
Multimodal communication plays an important role in many areas of research from linguistics to political science, from business to computer science.On the one hand there is a need to develop analytical models and methods for multimodal communication, combined with large multimodal datasets on which these models and methods can be tested.Naturally, this requires an ecosystem suitable for the collection of such datasets, along with pipelines for semi-automatic and automatic annotation.On the other hand, there is a further need to build capacity in the research methods suitable for multimodal communication, and then to deploy this in evidence-based policy settings and other knowledge exchange activities.
It is worth noting that some authors may have different definitions for modality.Though these definitions generally refer to distinct qualities (e.g.treatments in medicine), this paper requires that a mode be relevant to human communication.1.3.What is multimodal analysis of dis-/misinformation?Why is it important?What are the challenges?
Multimodal analysis is the analysis of two or more modalities-language, gestures, images, sounds, among others-in their interrelation and interaction.The study of one modality in isolation overlooks the complexity of communication practices in terms of how textual, aural, linguistic, spatial and visual resources are integrated to create a single discourse or communication unit.It is especially evident in the case of media.Multimodal analysis reveals and interprets the use of several modalities in composing media messages.It assesses how messages are transformed into tools of persuasion and manipulation.The latter is of particular relevance to the study of dis-/misinformation communication.The importance of researching dis-/misinformation in a multimodal fashion and at scale has been established thorough research on dis-/misinformation at the International Multimodal Communication Centre at the University of Oxford. 1 The ongoing research shows that there is a need to analyse dis-/misinformation not just in the sense of verifiably incorrect information (via fact-checking), but also in the form of certain types of framing of information which aim to mislead or obfuscate less explicitly but more insidiously [20].Such framings are more often than not achieved through multimodal communication.Multimodal analysis reveals the detailed composition of multimodal media messages-certain combinations of visual, audio and textual information-and their relation to socio-political, cultural and historical contexts.It reveals what makes these messages manipulative.The IMCC research has engaged, among other topics, with multimodal dis-/misinformation communicated by the Russian state and targeting international and domestic audiences (see, e.g.Uhrig et al. [21] on the scaling up of multimodal analysis of RT shows in English).
Consider just one example: the analysis of a Russian TV talk-show 'Pravo Golosa' (2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019).The show is broadcast in Russian and is representative of Russian disinformation communicated in covert ways.It discusses domestic and international news and invites guests with alternative (anti-Kremlin) viewpoints for the sole purpose of discrediting them in subtle, clever ways.On the surface, the programme seems to be supporting an exchange of views, arguments and constructive critique.In reality, the anchor (and the whole production crew) uses a wide range of multimodal strategies to ensure that the alternative viewpoints are discredited, but the manner of doing so is almost invisible to the untrained eye.At the same time, disinformation is communicated in an engaging and memorable way.
In one episode from February 2016 discussing Ukrainian politics and the outcomes of Maidan (the Ukrainian 'revolution' of 2014), the anchor relies on frames and conceptual metaphors such as Self-Other, Russia is a Great Power, Ukraine is Sick, and Anarchy and Banditry in Ukraine versus Law and Order in Russia.The construction of discrediting (disinformation) viewpoints is rooted in cultural and historical knowledge shared by Russians and Ukrainians.The anchor employs a range of manipulation techniques grounded in the co-presence of speech and co-speech gestures.For example, he uses deictic gestures to construct a strong overarching message: Ukrainian People Are Self versus Ukrainian Politicians Are Other.He first works hard to present the (purposefully selected) pro-Ukrainian panelists as incompetent, untrustworthy and corrupt.He then encourages the audience to associate their impression of the panel with all Ukrainian politicians and the whole of Ukrainian politics.Every time the anchor talks about Ukrainian authorities and politicians in a subtly discrediting way, he gestures towards the Ukrainian panel (the hand gesture pointing to the 'other' ).By contrast, the anchor brings his hand(s) close to his chest (the 'self' gesture normally accompanying words like: I, self, myself, my own, etc.) when talking about the Ukrainian people (on body-directed gestures see [22,23]).
The show also uses co-speech metaphoric gestures as tools of manipulation.These include examples of hand gestures adding crucially important dimensions to the meaning.For example, the purely verbal and relatively neutral 'You changed the [Ukrainian] regime' is transformed via a metaphorical hand gesture made by the anchor into the stronger 'You overthrew the regime' with the implication that 'overthrowing' was illegal/illegitimate.The anchor also speaks with a specific intonation, which adds the epistemic stance of certainty, and makes the statement incorporating the gestural 'overthrow' more categorical.Furthermore, the anchor uses gestures and corporal cues to construct a viewpoint of a strong and powerful Russia versus a weak Ukraine.One example is when the anchor adjusts his posture and moves as though he is preparing for a real physical fist fight.The accompanying hand gestures can be labelled as 'bring it on'.
These conceptualizations-Ukrainian people as SELF versus Ukrainian government/politicians as OTHER, Ukrainian Political Regime is Illegitimate, and Weak Ukraine versus Strong Russia in its 'bring it on' aspect-have continued to be communicated by Russian state media as disinformation until the present time.Those conceptualisations were among the key ones on which Putin relied in his speeches of 21 and 24 February 2022 when justifying the start of Russia's 'special military operation' (the war against Ukraine) and more recently in e.g.Putin's speech at the ceremony of annexation of Ukrainian regions on 30 September 2022.
Experimental psychologists interpret the use of such co-speech gestures as lowering the cognitive load on the audience and distributing semantic information across language and visual inputs.They also emphasize that once the information conveyed by both language and co-speech gesture has been processed by the viewer, the influence of it cannot be undone.For example, Kelly et al. [24] showed that gestures cannot be ignored, even when people are asked to just make judgements on speech.Gesture-speech integration is 'automatic.'The viewer does not register what parts of the information are conveyed by which mode, and would not think of the work done by a particular gesture as a manipulative technique.The main implication for disinformation communication behind the examples here is that on the linguistic level the information sounds reasonably neutral, yet when combined with co-speech gesture, it is enriched with semantic nuances that make the overall resulting message into successful disinformation.Such manipulation techniques allow Russia (or other hostile states) to make its disinformation covert-more subtle yet powerful-and to avoid accountability for the disinformation it communicates, and make it very difficult for viewers to spot that they are being manipulated or understand how they are being manipulated [23].The text-only approaches which currently prevail in dis-/misinformation analysis are missing the information communicated multimodally, which makes the results of text-only analyses of dis-/misinformation much less reliable.
The very nature of multimodal analysis necessitates the development and application of multidisciplinary and interdisciplinary approaches-a task which is far from trivial.
Computer scientists have recently become more interested in multimodal analysis too.Correctly dealing with multimodal inputs is of huge importance to the field, particularly machine learning (ML) research.Applications range from sophisticated robotics to disinformation detection.Multimodal analysis in computer science has been buoyed by recent advancements in both hardware and ML techniques.
Although there are studies successfully researching multimodal communication, there are also many missed opportunities stemming from the lack of interdisciplinary approaches.There is also a lack of studies focusing on analysing ecologically valid multimodal data in context and at scale.One meaningful initiative which had attempted to bridge the latter gap is the Red Hen Lab [46]. 2 Bearing in mind the broader situation with multimodal communication research, we engage with research on multimodal dis-/misinformation more specifically.

Research questions
The importance of considering disinformation through a multimodal lens, and its highly interdisciplinary nature, motivate our research questions:

RQ1:
To what extent have studies across different disciplines engaged with multimodal analysis of dis-/ misinformation?What is the extent of interdisciplinary practices within the field?RQ2: What methods and data are used by the studies of multimodal dis-/misinformation? What modalities do studies engage with?What value does multimodal analysis add according to the studies?Is multimodal analysis of dis-/misinformation a well-formed research area?What are the challenges this field is facing?RQ3: What types of multimodal studies of dis-/information add value to the field and in what ways?
In the process of investigating the above, we will present the map of research trends observed within the field.The ultimate goal is not to dictate specifics, but instead draw the research landscape for multimodal analysis of dis-/misinformation, suggest a future research agenda for the field and inspire best working practices and approaches.
Our review analysis is divided into two stages: publications before the second quarter of 2020 (Stage 1), and publications from the second quarter of 2020 to August 2022 (Stage 2).Stage 2 reflects the rapid increase in the interest in multimodal analysis by computer scientists.

Keyword selection and database searches
The reporting strategy follows the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses; http://www.prisma-statement.org)reporting the checklist approach to systematic literature reviews (figure 1) [47].
The current review was interested in papers with three characteristics.First, they had to study disinformation or misinformation.Second, they had to focus on more than one modality (e.g.image and text or sound and image or video and text).Finally, the articles had to focus on traditional broadcast or social media.
Scopus, a database of peer-reviewed articles in a variety of fields including science, technology, medicine, social sciences and arts and humanities, was used to search for records published using the following search criteria: This yielded 980 results on Scopus in April 2020.

Screening
The results were manually screened based on the title, abstract and keywords.Articles were excluded if they were not written in English, were a collection of conference proceedings rather than individual articles, or were review articles.Furthermore, abstracts were excluded if they focused on a single modality or if they were not on data taken from social media, advertising or broadcast media (e.g.television, radio, newspaper).Likewise, articles were only included if they were about events in the period 1900 to the present.Articles studying strategies to reduce misinformation or disinformation like education programmes and public health campaigns were excluded, as were studies on the misinformation effect and memory malleability in the context of witness reliability.Eligible articles were then divided into two categories.The first category contained abstracts that explicitly mentioned the use of multimodal analysis to study the content of dis-/misinformation on social media or broadcast media.The second category included abstracts that looked at mis/ disinformation more broadly (e.g.responses to misinformation) or articles that looked at political propaganda with no explicit evaluation of the veracity of the information contained in the materials studied.

Identifying the field of research
In order to identify what disciplines were represented in our dataset, we used the 'All Science Journal Classification' (ASJC) database published by Scopus, which assigns one or more subject and subsubject code to journals in their database.Eighty-nine of the sources in our dataset-primarily conference proceedings and books-had to be labelled manually as they were not in the Scopus database.For visualization purposes, we grouped the sub-subjects into intermediate categories.

Topic modelling
Topic modelling using latent Dirichlet allocation (LDA) was performed on the abstracts.This technique uses word frequencies to identify topics and was used to gain a better understanding of the themes that were being studied.We performed this topic modelling on the full list of eligible articles as defined in the 'Screening' section.The number of topics was chosen using coherence scores.

Qualitative analysis
A total of 101 full-text articles was analysed qualitatively.We assessed these based on five criteria, namely: (i) whether they were written in English (irrespective of whether they had an Englishlanguage abstract); (ii) whether the full text was available online; (iii) whether grounded in original research; (iv) whether they focused on more than one modality; (v) whether they focused on misinformation, disinformation, 'fake news' or propaganda.Forty-nine of the 101 articles met these criteria and as a result were included in the final in-depth qualitative analysis.
The full text of all papers was reviewed qualitatively and information about each was added into an extraction table covering the following points: (i) Bibliographical information (ii) Data used (source of the data, how they were obtained, how dis-/misinformation was identified) (iii) Modalities studied (iv) Methodology used (v) Working definition of dis-/misinformation (vi) Main findings (vii) Value of multimodal analysis according to the authors (viii) Ethical or social challenges raised by authors

Co-citation network analysis
We created a co-citation network of the 49 articles included in the full text qualitative analysis.The reference list for each article was obtained from its SCOPUS entry, and the journal names were extracted using regular expressions.Each journal was a node in the network, and edges were drawn between journals that cited each other.Network analysis was carried out using networkx and community.Community detection was carried out using the Louvain algorithm with a minimum community size of 10.In addition to looking at citations between journals, we also obtained what subjects were citing each other, using the SCOPUS database of journals.Of the 2043 journals and publication venues, only 1616 were in the SCOPUS database as books and conference proceedings were not in the database.

Stage 1: Results
The eligible records were identified using the title, abstract and keywords.In total, 303 articles focused on more than one modality, focused on dis-/misinformation and used data collected from social media or royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 broadcast media.A subset of these records (n = 101) focused specifically on analysing the content of dis-/ misinformation.Most of the analysis presented here focuses on the entire set of eligible records, but some comparisons are drawn between the eligible set and the content-specific subset.
As expected, the number of studies focusing on dis-/misinformation has increased in the last decade (figure 2a).This increase appears to happen in two distinct phases.The first increase from 2008 to 2016 is probably due to the increasing popularity of online media including social media platforms like Twitter, Facebook and Instagram as well as online platforms like YouTube.The second phase is much steeper and started from 2016 onward, and is probably due to the increasing sensationalization of 'fake news' and online misinformation.The apparent dip in 2020 is due to the fact that the data collected for Stage 1 only includes the first quarter of that year.
Most of the eligible records were published as research articles in journals as well as part of conference proceedings (figure 2b).There was a total of 703 unique authors, but most of these authors were only in one publication.Only three authors appeared in three records and three authors in four records (figure 3a).Likewise, there were 241 unique publication venues, but very few of them appeared more than once (figure 3b).Both of these results suggest that multimodal analysis on dis-/misinformation in media is not concentrated in a select number of established research communities, but rather publications are spread out across many journals and conferences, and few researchers have (yet) done multiple studies on this topic.
The absence of journals with a large number of records or authors that published a lot in the area suggested that multimodal analysis of dis-/misinformation in media does not have a single research community.This suggested that the eligible sources could belong to a diverse set of disciplines.The 241 sources were cross-referenced with Scopus' source database that has the primary subjects and subsubjects published in journals.For conference proceedings that were not in the database, the subjects and sub-subjects were hand-coded.Most of the records were published in the social sciences as well as the arts and humanities, but there were a lot of other fields represented including computer science and life sciences (figure 4a).Furthermore, for the records in the social sciences, there was also a diverse set of sub-subjects with the primary areas being communications and political science (figure 4b).While multimodality is a 'hot topic' in computer science (especially within natural language processing and computer vision), our analysis found that within Stage 1 few computer science publications on multimodality were specifically about dis-/misinformation.  royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 Finally, in addition to understanding the disciplinary contributions, the methodology used for the multimodal analysis was studied.This analysis focused only on the subset of records with a focus on analysing dis-/misinformation multimodal content.The strategy for multimodal analysis was determined to be qualitative (e.g.discourse analysis) or quantitative (e.g.deep learning) or both.The breakdown of subject areas for this subset was similar to the breakdown of subject areas of the full set of eligible records (figure 5a).Even though social sciences and humanities were the primary subject areas represented, more than half of the articles used quantitative methods (figure 5b).This suggests that several of the studies in the social sciences used quantitative methods in addition to the records published in engineering science or computer science journals.Nonetheless, there are still very few papers that attempt to combine both qualitative and quantitative approaches to multimodal analysis.

Topic modelling
Topic modelling was then applied to understand what topics studies on dis-/misinformation in online or broadcast media were focused on.The most common words in the dataset after removal of the most common words in English included many of the search terms used in Scopus like 'news' and 'propaganda' (figure 6).However, there were words like 'political', 'war' and 'state' that suggest that a lot of the research focuses on dis-/misinformation in the context of political events or conflicts.
Our study trained a LDA topic model on the abstracts of our subjects.LDA is an established method in topic modelling that uses the frequency of words in each abstract to iteratively assign words to topics and topics to abstracts.In order to determine the number of topics to use, the coherence score was used, royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 which is a measure of the quality of the topics that were found.Typically, the coherence score plateaus or falls after reaching a good number of topics.Based on the scores (figure 7), the rest of the analysis was carried out for N = 4 topics.Table 1 lists the topics detected by the LDA algorithm for N = 4 topics, with the top words assigned to each topic and a qualitative interpretation of what each topic may correspond to.Three of the topics correspond to types of misinformation, where Topic 1 corresponded specifically to online misinformation and Topics 0 and 3 corresponded to propaganda for political and religious purposes.Topic 2 appeared to have a mix of different themes including gender, war and religion.royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 Each abstract will contain words that belong to each topic, so each abstract can be assigned one or more topics that mostly represent it.The predominant topic for each abstract was obtained, and the three topics related to misinformation were similarly represented in the abstracts (figure 8a).Topic 2 was the main topic for only 18 abstracts, but appeared as a topic in 39 abstracts, which may suggest that this topic contains words that refer to multiple aspects of the dis-/misinformation content being studied.This is confirmed by looking at the frequency with which topics co-occur with each other normalized to their relative frequency.Topic 2 is the only one that seems to co-occur with the other topics (figure 8b).Among the three dis-/misinformation topics, the two propaganda-related topics (Topics 0 and 3) had a slightly higher co-occurrence compared to the overlap between Topic 1 and either Topic 0 or Topic 3.
We looked at the distribution of different disciplines within each topic (figure 9).There is a clear disciplinary divide in the vocabulary employed to refer to dis-/misinformation, because while articles published in social sciences and humanities journals spoke about it in terms of words related to

Qualitative analysis
Next, a more in-depth qualitative analysis was carried out on the full text of 49 articles.We engaged with the parameters for the analysis outlined in §2.6 above, which are translated here in the subheadings in this section of our paper.Our engagement produced the following results.

Methodology employed by articles
Methods.Most studies used either quantitative (n = 20) or qualitative (n = 22) methods, but around 10% of them used a mixture of both (n = 7).Data.Apart from the types of methods used, there was a marked difference in the data sources employed by quantitative papers as opposed to qualitative or mixed methods papers.Over half of the quantitative papers (n = 12) used publicly available datasets in their analysis, whereas the qualitative and mixed method papers all collected data specific to their proposed research questions.Two datasets that were used by multiple papers include the Weibo dataset [48] and the MediaEval Twitter dataset [49].Both of these are freely available and individual items have been labelled as rumour/ non-rumour or fake/true, respectively.
Combinations of modalities.We found most articles focused on two modalities rather than three (noting that articles analysing only one mode are not within the scope of our review).Forty articles focused on the analysis of two modalities.Of these, 38 focused on image and text [20,48,.Two articles focused on sound and image [86,87].

Qualitative papers: theoretical frameworks and methods used
Our meta-analysis revealed quantitative methods were used by 20 articles while qualitative methods were used by 22 articles; seven papers additionally used a combination of quantitative and qualitative analyses.We used qualitative analysis to establish which theories and methods were employed by 22 studies that researched dis-/misinformation through qualitative methods.Our meta-analysis revealed: (I) Eleven papers used a form of discourse analysis as their core method as rooted in linguistics, rhetoric and/or social semiotics, while combining their discourse analysis with a number of other theories and methods key to their analysis of content.Of those: - royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 (IV) One paper [84] included network and spatial distribution analyses to investigate how disinformation spread.(V) One paper [95] the development of software to perform fact-checking.
We note that except for three papers in (II), all the publications above originated from the field of computer science.The majority of quantitative papers tried to automate the detection of false news by the training of ML algorithms.On the other hand, only one paper attempted to study the actual properties of disinformation using ML.This disparity is indicative of a wider issue.It indicates that advanced quantitative methods to detect multimodal disinformation are being prioritized over investigating multimodal features of disinformation; there is no historical corpus to suggest the latter is a solved problem.At the same time, the narrow spread of research highlights the possibility of missed opportunities to answer broader research questions, particularly those outside computer science.An interdisciplinary approach may avoid this.
Though these quantitative papers evidently engaged with the topic of disinformation, rarely did they investigate the roles that multiple modalities play.This means despite the clear interest from computer science, it is predominantly other disciplines that drive forward the theoretical understanding of multimodal disinformation.

Instances where papers employed both qualitative and quantitative methods
Incorporating both quantitative and qualitative methods probably indicates the development of preexisting manual approaches.Seven papers were found to do both; these can be considered to fall within three broad categories: (I) Expanding qualitative analyses with quantitative approaches one paper [65] used the approach of multimodal discourse analysis by C. Jewitt to annotate a subset of the dataset studied; then a software-based multimodal analyser was used to expand this analysis across their dataset; one paper used systemic functional multimodal discourse analysis (SF-MDA) based on work by O'Halloran followed a similar approach as above by using the same software [78]; one paper first has health experts manually annotate videos [88], and after extracting for a range of video features (such as a transcript or acoustic features) a ML model is constructed to detect health misinformation within videos.(II) Attempting to clarify quantitative results by subsequently applying qualitative analysis after automatic coding of URLs was used to determine the types of media shared on WhatsApp, one paper [96] manually reviewed a randomized subset of the collected data to provide a more fine-grained understanding of the URL content; one paper that created a multiclass classification network examined a collection of the model's outputs; from this they could determine what categories of material their model was able to better recognize [69].(III) A collection of disparate methods used as part of a single topic of investigation one paper manually sought to descriptively categorize the types of shared content, before engaging in a network analysis of the spread of such content [51]; one psychology paper employed a range of statistical analyses on their experimental results which was followed by qualitative insights gained from subject interviews [61].
While six papers above had computer scientist authors, three of these papers had interdisciplinary authorship.In the context of all the computer science papers we analysed, interdisciplinary authorship accounted for only around 15% of the work.Two of these papers included arts and humanities authors, while the third featured contributions from the medical sciences.This interdisciplinarity unlocked the possibility of new work.

Definitions of dis-/misinformation
All of the articles analysed looked at an aspect of how information can be manipulated or distorted in online or broadcast media in the form of misinformation, disinformation, or more broadly propaganda or bias.More than half of the articles (n = 33) assumed the definition of misinformation implicitly and did not provide a definition.Only five articles provided a definition of disinformation and four articles provided a definition of misinformation, while eight articles provided a definition of fake news.
royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 Of the articles that provided a definition of misinformation and disinformation, only one article [64] used the definition as used in this paper.While all of the variations of the definition of disinformation alluded to the intentionality of the agents spreading false information, the definitions of misinformation were more variable.Unlike the definition adopted by this report, none of them noted the unintentional nature of misinformation.This suggests little progress in the field of understanding the mechanisms and subtleties of (multimodal) misinformation.
For papers that consider detection of dis-/misinformation, it is crucial to define clearly and fully the problem being investigated.The provision of clear and full definitions of dis-/misinformation is an issue the research field needs to address going forward.

Value of multimodal analysis according to the authors of studies
Six out of 49 articles did not discuss the value of multimodal analysis.The articles which addressed this question did so with a varying degree of explicitness.Those which were more explicit noted that multimodal analyses add value in that, for example: a multimodal approach which considers the functions of language and images and/or videos together has the potential to shed further light on understanding the construction and impact of propaganda [78], with the visual modality typically being key for communicating across countries and cultures [82]; visual input changes the way a person is perceived [20]; ignoring visual content on social media loses a lot of the content [80]; the text, sound, dance, visuals and context interact as a unit to convey multiple layers of meaning [94]; multimodal considerations of multimodal data boost detection rates when compared to unimodal approaches [48].

The overall findings of the published studies
The articles differed in terms of how clearly they presented their findings and achievements and also in terms of the domains within which those findings and achievements fell, namely: (i) the development of theoretical and methodological approaches for multimodal analysis of dis-/misinformation including but not limited to the detection of fakes; (ii) the creation of new multimodal datasets; (iii) insights from qualitative and/or quantitative analyses of multimodal data (e.g.discourses and media contents) of a certain kind.The broad variety of theories, methodologies, and type of data used by the studies considered prevented us from going deeper into the analysis of their findings and achievements here, but the majority of authors emphasized the importance of analysing visual and sound data in addition to textual data to improve the accuracy and validity of dis-/misinformation and propaganda research including that on the detection of fakes.At the same time the authors commented on the challenges which such multimodal research presents.We engage in a deeper analysis of the findings of the subcategory of the studies reviewed below in Section 'Stage 2: Zooming into multimodal disinformation and misinformation publications after March 2020'-those mainly originated from the computer science.

Ethical or social challenges raised by the authors of studies
Out of 49 articles, only one explicitly engaged with of an ethical challenge [73].The paper, having observed and followed 'far-right and neo-facist' social media posts, touched upon the ethical issues of researching violent and extremist content.It also discussed how researchers can be protected and whether to reveal the identities of people calling for violence in the context of the wider issue of the invasion of privacy.
For the purposes of their study, the author chose to anonymize all the published results.
Research on multimodal dis-/misinformation ought to engage better, and more explicitly, with ethical and social challenges.Such engagement ultimately translates into informing government policies, among bringing other social benefits and needs to form part of the research field.

Co-citation networks
The network community detection is good at recovering groupings that we had identified based on the methods and research questions of the papers.The co-citation analysis and visualization was performed on journals that were cited a minimum of 5 times.
Then we looked at what fields were citing each other (results shown in table 2).The subjects are taken from Scopus.With the exception of journals in life sciences (which in our dataset were journals in royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 cognitive neuroscience), journals in one discipline predominantly cited articles published in journals in the same discipline.For example, of the citations within social science and humanities journals are to other journals in the social sciences and humanities.Overall, social science journals made up at least 25% of all citations regardless of the discipline of the citing journal.

Stage 2: zooming into multimodal disinformation and misinformation publications after March 2020
The growth of papers falling within our search parameters has an exponential profile, as shown in figure 2. Hence, individually sifting through each paper would become unfeasible.Moreover, as COVID-19 has shaped many research interests, we felt it was appropriate to set March 2020 as a threshold to more narrowly consider only publications that explicitly aimed to research multimodal dis-/misinformation.To do this, the steps of §2 were repeated in August 2022, except that we additionally filtered out any papers that did not contain the (case-insensitive) key words fMULTIMODAL, MULTI À MODALg in either the title or abstract.This reduced the list to n 0 = 133.Similar to Stage 1, papers that were not directly related to multimodal dis-/misinformation were manually excluded: this gave a final n f = 78.We report on our observations relevant to advancing the field, highlighting best practices.

The rise of computer science since second quarter of 2020
From 2020 onward, computer science (CS) clearly became the significant majority of all these publications considered (figure 10); in fact it largely accounts for the overall growth in publications.Evidently this particular shift to CS merits further analysis.Out of the n f = 78, n cs = 73 had CS authors, from a range of international institutions (table 3).This section aims to motivate the direction of future work.
4.1.1.Automatic detection of multimodal dis-/misinformation with machine learning Advances in ML algorithms, coupled with decreased costs to access computing hardware, has led to many more computer scientists applying ML to new tasks.As such, the stated purpose of an royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 overwhelming 90:4% of papers within this Stage 2 review set proposed a new method for the detection of multimodal dis-/misinformation.Applying ML to dis-/misinformation has garnered such interest that within our review set two new workshops in this area had emerged.A number of papers were from the workshop 'De-Factify'-concerning multimodal fact-checking and hate-speech detection-and another paper was submitted as part of the workshop 'MAD2022' which focused on multimedia disinformation.Usefully combining multiple distinct inputs for use in neural networks is challenging.This is because the various inputs can be weak (uninformative), or may have a strong interdependence.As multimodal disinformation varies widely, so does the salience of each modality, or their interactions.Consequently, Chen et al. [99] and Song et al. [100] show that including multiple modalities, without additional handling, can increase detection noise.This suggests that dynamically altering the importance of each modality is crucial.Figure 11 depicts a generalized multimodal disinformation detector; a dynamic implementation is able to vary the weighting between modules 1-3 depending on the input data.If the weightings are frozen after training, this naïvely assumes that there is minimal distributional difference between the training data and the real world, which is often unwarranted [101].A total of 8 papers did this.Unfortunately, a further 23 papers (31:5%) in our Stage 2 review set had either a weighting of 0 for module 3's connections, or performed 'late fusion' of single-modality classifier outputs.Neither of these methods is sufficient to infer the general properties of disinformation, which can depend strongly on the multimodal interactions.
Finding more suitable misclassification penalties (i.e.'loss functions') can itself prove effective.Three papers [102][103][104] attempted to address the heterogeneity of the disinformation landscape by enforcing  orthogonality between distinct news events.This was done by employing an 'adversarial approach' to the loss function during training.The competition between the generator and classifier networks helped to capture differences between domain-specific and domain-independent features.This approach requires defining, or categorizing, these domains a priori.

Attempts to analyse the models and their decisions
Most papers engaged with our question of unimodality versus multimodality by disabling a modality input-a process referred to as 'ablation'.However, these tended to show only small improvements from the inclusion of images.This may be in part be explained by the strengths of natural language processing algorithms.Some authors additionally performed qualitative analysis of their ablation, but often the examples found, where a multimodal paradigm prevailed, had minimal linguistic rationale.We point to Wu et al. [105] as a clear example where the authors' model had captured modality-specific cues.
Understanding why models respond in the way they do is one of the goals of developing 'explainable AI' or 'XAI'.Given the potential complexity of multimodal disinformation, if such detectors are to be seriously considered in the real world, incorporating XAI methods may become a key requirement.The three main approaches we encountered were: (I) Direct analysis of the detection model; only one paper [106] explicitly did this, which provided insight into what aspects of the data their model was reacting to (II) Examination of the dataset(s) properties mainly constrained to dataset creation papers (see next section), though estimates for biases are not always performed; the models may be configured to present data statistics, for instance presenting the levels of inter-modality interaction [99], the discordance of each modality [107], or the model's modality weightings [108].(III) Visualizing the decision; visualizing data and the model can offer insight to researchers, but aside from [109] this was rarely attempted Particularly for (II) and (III), domain experts can be invaluable for finding clues and patterns; this again suggests an interdisciplinary approach may provide new insights.Lastly, we note that for (II), there has been no work systematically exploring biases (and if these biases may be themselves be multimodal) and their effects on disinformation detection.

'Fake News': definitions, prevalence and classification consequences
As seen in the papers in Stage 1, key words relating to disinformation are often not defined, only accounting for 37% of this subset.This may be partly explained by widespread adoption of the term 'Fake News'.This is synonymous with binary classification-the dominant paradigm.By contrast, only seven papers (9.6%) acted to classify in at least three ways [102,[110][111][112][113][114][115][116]; definitions were introduced to justify their classification objectives.A general disinformation detector cannot fall within the scope of a binary classifier.It is still possible for a binary classifier to able to accurately categorize a subset of disinformation; defining this type of disinformation is then a practical necessity. .This is a general schematic of a bimodal disinformation detector.The blue boxes labelled 1 and 2 are modules that process each modality of the input data.The role of module 3 is to pass information, no matter how complex, between modalities.The outputs of these modules are first aggregated before then being passed into a black box classifier; concatenation and 'multilayer perceptrons' were the most commonly observed ML methods, respectively.

Multimodal data, its properties and how it was used
There were five non-CS papers.Of these, full video featured heavily; two papers applied multimodal analysis [113,117] to investigate the video content and messaging, whereas one psychology paper investigated the effects of flagging deepfakes on subjects [118].On the other hand, only a couple of CS papers went beyond two modalities: one focused on examining sequences of images and their captions from YouTube [119] and one paper on TikTok went further still by also incorporating audio [120].Some papers stated metadata as an additional modality, but as justified in §1.2 this would not be counted in this paper.The sources of data used by papers were not extensive; the levels of data-source mixing are depicted in figure 12a.
We saw a number of detection papers expressing regret at the lack of accessible datasets.The stated aims of seven papers were to add to the range of available training data.The main challenge is labelling large quantities of data.For datasets focusing on collating online news articles, two papers chose to label by quantifying the publishers 'credibility' based on results from fact-checking organizations [121,122].This is problematic though as disinformation is complex and can be intermingled within legitimate news and claims; see, for example [123].Using automatic claim matching, two papers sought to create multilingual datasets [116,124].One paper manually verified samples of the data studied, which exclusively focused on Reddit posts [111].Another approach is to augment [125] or synthesize data [126].Only one paper considered the threat of the dataset studied, filtering out results that were too straightforward by using an adversarial approach [126].Moreover, the authors tested their dataset on real humans as a benchmark-an important step, only otherwise fulfilled in [112]-finding both that humans struggled to distinguish between fake and real examples and that their detector performed at a comparable level.It should be noted that while many papers conducted analyses of the datasets' textual content, only two papers [110,127] additionally collected statistics on their images.
Only two papers attempted to test their detection models on live unseen data.For instance, Wang et al. [128] scraped and manually labelled COVID-related Instagram posts once a day for a month, and obtained very similar classification results to their initial offline run.
The early detection of disinformation is vital in limiting its harm, but unseen disinformation can prove challenging for detection (zero-shot classification).If instead a few initial examples within a nascent category are allowed to be manually labelled, this can allow for 'few-shot classification'.To this end, one paper presented a meta-learning approach [129], which aimed to jointly learn category and global features as they arose.Similarly, one paper held back training examples based on their post time to mimic real life conditions [130], testing their models performance for different delay times.Overall, we found limited engagement with temporal features of disinformation.In particular, no papers considered the longer term evolution of disinformation, nor presented analysis on any distribution shifts.

Interdisciplinary and non-computer science publications
The stricter corpus requirements left few papers outside CS.Though the sample size is small, the remaining non-CS papers neither notably deviated in approach nor in quality from those papers studied in the Stage 1. Figure 12.Shown here is where the data originates and how these sources mix.The breakdowns were chosen to highlight the sorts of disinformation studied.A source was counted if the paper's data source was distinct (e.g.datasets 'X' and 'Y' count as two).We counted 91 instances of sources, hence an average mixing of 1.25 sources per paper-largely independent.(a) Depicting the mixing of data sources and (b) breakdown of 'Miscellaneous' in (a).royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 In one paper [127] theories of communication both informed the quantitative approach and yielded insight into the statistical features of their dataset.Two CS papers [110,112] stood out for their use of interdisciplinary methods.These papers both studied memes-inherently multimodal objects-and drew from methods outside CS for data annotation and analysis methods, as well as informing their use of quantitative methods.

Discussion and conclusion
Our meta-study has shed some light on how research on multimodal dis-/misinformation in media communication has been evolving in the past 20 years.Our division of meta analysis into two stagesbefore and after the second quarter of 2020-has reflected changes in the development of the research area due to the start of the COVID pandemic in the second quarter of 2020, namely the prevalence of multimodal research of dis-/misinformation originating from computer science from 2020 onwards.
Our meta-analysis in Stage 1 identified 303 articles researching dis-/misinformation in media while also focusing on more than one modality.101 out of those 303 performed multimodal analysis of the content of dis-/misinformation.Our further in-depth qualitative analysis focused on full texts of 49 articles that met our inclusion criteria.
Our meta-analysis has revealed that there is no single disciplinary or cross-disciplinary community employing multimodal analysis to study dis-/misinformation.Most authors and publication venues appeared only once in our dataset, which suggested that multimodal analysis of dis-/misinformation was not the main subject of study of any given researcher or research community.The diversity of disciplines, from which articles originated-from the social sciences and computer science to management, engineering and health sciences-further pointed to the lack of an established research community with the focus on multimodal dis-/misinformation.Our topic modelling analysis of the abstracts revealed a disciplinary split.Abstracts from articles published in the social sciences and humanities scored highly on the propaganda topics and low on the online media topic, whereas those published in the physical sciences (including computer science) scored highly on the online media topic but not on the propaganda topics.This dissociation suggested that a barrier to establishing a cross-disciplinary research community was a lack of common focus, terminology and definition of dis-/misinformation.
Our in-depth analysis of 49 full-text articles revealed a further binary opposition, this time related to the research focus and method used: articles which employed quantitative methods were primarily interested in creating frameworks to detect whether a particular news item or social media post was 'fake news' or not.By contrast, articles using qualitative methods mainly focused on propaganda analysis, as well as multimodal strategies of persuasion and manipulation in media discourse and communication.Only 7 articles out of 49 articles used a mix of qualitative and quantitative methods.Those originated from disciplines which would traditionally use a qualitative approach to analysis.Quantitative methods were used by them to scale up and further support qualitative analysis.
The majority of studies engaged with the question of the added value of multimodal analysis as well as of the challenges of developing and applying theories and methods suitable for multimodal analysis of dis-misinformation.For many papers these engagements constituted part of the studies' findings.
Our analysis demonstrated that only one paper out of 49 engaged explicitly with the question of ethical and social challenges of multimodal research on dis-/misinformation.We argue that these challenges need to be addressed by the field better in the future.
Furthermore, more than half of those 49 articles did not contain definitions of misinformation, disinformation, fake news or propaganda.If present, the definitions varied across studies.This suggested the lack of uniform understanding of the objects of study to which those terms refer.
Our analysis in Stage 1 revealed that research on multimodal dis-/misinformation would benefit from the development of one established area, with clear definitions of research objects, a goal to address ethical and social challenges, a unified terminology and cross-disciplinary methodological practices.Cross-disciplinary practices would benefit not only disciplines that traditionally use qualitative methods, but also those which would traditionally rely on quantitative methods.
Although we observed a general increase in studies focusing on multimodal dis-/misinformation in 2008 and 2016, it is only from the second quarter of 2020 that we observed a rapid increase in computer science studies on multimodal dis-/misinformation.
As our meta-analysis of Stage 2 demonstrated that there was no notable change in the research on multimodal dis-/misinformation which originated from disciplines other than computer science.It is computer science studies which became the driver for the explosive growth of research on royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 230964 multimodal dis-/misinformation.That growth came with scholars across many cultures engaging in multimodal dis-/misinformation research.
Those changes motivated the emerging need to understand the extent to which the choices in ML techniques, more specifically, were informed by knowledge originating from humanities and social sciences.The in-depth examination of full-text articles at Stage 2 revealed positive dynamics in how the computer science studies under consideration addressed the complexities associated with the analysis of multiple modalities.However, while in aggregate these papers employed a range of multimodal strategies, no single paper brought these together.The missed opportunities were driven to a considerable extent by the lack of interdisciplinary approach to analysis.We also identified clear research gaps, such as the limited work on the temporal nature of multimodal dis-/misinformation.
Our meta-study at Stages 1 and 2 demonstrated that most studies, regardless of discipline, focused on two modalities rather than three.This may be explained by scholars' intention to keep analysis more straightforward, but also by the use of pre-prepared data.Especially within computer science, most studies used existing datasets rather than constructing their own.
For articles analysed in both Stages 1 and 2, we observed no engagement with questions about how dis-/misinformation evolves over time, including shifts in distribution patterns.We consider that this is at least partly due to a lack of studies which employ true interdisciplinary approaches to investigation of dis-/misinformation.
Nonetheless, single-discipline papers still brought value to the overall study of multimodal dis-/ misinformation.In addition to moving beyond text-only approaches, papers that provided definitions of dis-/misinformation, those constructing novel datasets-especially video-and those which used a combination of qualitative and quantitative methods were particularly valuable.
Our meta-analysis has revealed the potential for computer science techniques to aid theories of multimodal dis-/misinformation communication originating from the range of disciplines in humanities and social sciences, and scale up qualitative analysis to provide statistical validity.Explainable AI could be a large help in this regard, especially if developed with social science and humanities expertise.More interaction across the humanities and social sciences with computer science could enable further development of AI methods for multimodal analysis of dis-/ misinformation.This would require more interdisciplinary research and collaboration to ensure better understanding of the findings originating from the disciplines of humanities and social sciences.
Our meta-analysis has also demonstrated the challenges of conducting multimodal analysis of dis-/ misinformation and the nature of the associated gap in research.The gap manifests itself through the absence of a coherent body of multimodal research on disinformation and misinformation.The divide between different disciplines and research interests in the field was present throughout our analysis including the topic modelling of abstracts, the co-citation analysis and the manual qualitative analysis on the full text of 127 articles (49 full-text articles were analysed at Stage 1 and 78 articles at Stage 2).
With the advent of accessible computing technology, large scale quantitative analyses constitute a clear new avenue for research into multimodal disinformation and misinformation.Indeed, we observed a recent uptake of this approach; however, efforts to leverage these methods have largely been confined to computer science.This has resulted in many missed research opportunities and even has manifested in experimental design and analysis that is not motivated by theories of multimodal communication.Moving forward, creating a more unified research landscape is needed, which will require the development of unified terminology and definitions suitable for analysis of multimodal dis-/misinformation, as well as a conscious effort from scholars to cross boundaries of disciplines.Among other things, interdisciplinarity should enable more studies to focus on video data and as a result to examine three modalities-verbal (text), sound, visual-as opposed to just two modalities.Further development of interdisciplinary approaches to the analysis of multimodal dis-/ misinformation should also empower researchers to investigate at scale subtle manipulation which forms a large part of dis-/misinformation communication, but is more difficult to research than 'fakes'.

Figure 2 .
Figure 2. (a) Number of records published each year (1900 to 31 March 2020) within scope of Stage 1.(b) The number of records for each type of document.

Figure 3 .
Figure 3.The number of records by the same author (a) or records published in the same source (b).

Figure 4 .
Figure 4. Subject breakdown.The breakdown of subjects for all eligible records (a) and the breakdown into sub-subjects for the records in the social sciences (b).

Figure 5 .
Figure 5. Methodology.The distribution of subjects for content-specific eligible records (a), and the type of methodology used in these records (b).

Figure 6 .
Figure 6.The 20 most common words in the abstracts after stop words were removed.

20 Figure 7 .Figure 8 .
Figure 7. Coherence scores for LDA models for different numbers of topics (minimum 2).The red-highlighted points indicate possible choices of topic numbers.

Figure 9 .
Figure 9.The frequency with which each topic was the primary topic in an abstract.

Figure 10 .
Figure 10.This chart compares the number of CS versus non CS publications across both Stages 1 and 2 using the stricter selection criteria outlined in Stage 2. All papers were screened to check that they were applicable to dis-/misinformation; the earliest example dated from 2015.

Figure 11
Figure 11.This is a general schematic of a bimodal disinformation detector.The blue boxes labelled 1 and 2 are modules that process each modality of the input data.The role of module 3 is to pass information, no matter how complex, between modalities.The outputs of these modules are first aggregated before then being passed into a black box classifier; concatenation and 'multilayer perceptrons' were the most commonly observed ML methods, respectively.

Table 1 .
Topic modelling results for N = 4 topics.and propaganda, articles published in the physical sciences, like computer science, tended to focus on social media and use terms like 'fake news'.

Table 2 .
The tendency for a subject to cite another subject is shown between four subject categories.Specifically, the proportion, for each subject, of citations for the citing journal to the cited journal is shown in the right-most column.

Table 3 .
[98]'intensity' of country's research into multimodal disinformation was estimated by summing the instances of each participating institution across all papers, and normalizing by the number of citeable documents for that host country[98].